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Abstract 

We introduce a stochastic model which describes diffusions of tweets on the 
Twitter network. By dividing the followers into generations, we describe 
the dynamics of the tweet diffusion as a random multiplicative process. We 
confirm our model by directly observing the statistics of the multiplicative 
factors in the Twitter data. 
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1. Introduction 

As a popular microblogging web service, a significant attention is paid 
to Twitter. One of the important points for users of Twitter is how well 
one's writings, or tweets, are spreading through the network. Its significance 
is obvious from the facts that the counters which measure the popularity 
of tweets are everywhere on the web and that many companies are using 
Twitter as an advertising tool. There have already been many data related 
to Twitter [l], 0, S 0, 0, @, 0] presented in the literature. 

On Twitter, users can follow other users and read their tweets without any 
approvals, thereby constructing a directed network among them. Users can 
also propagate tweets to their followers thanks to a characteristic function 
called retweet which results in information diffusion. The aim of the 
present paper is to introduce a stochastic model for the tweet diffusion of the 
daily tweets on the Twitter network. The reason why we investigate the daily 
tweets is because we can expect a universal behavior; in the case where we 
selected tweets with specific keywords, we might observe irregular behaviors 
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Figure 1: (Color online) Diffusion network on Twitter. The node at the center represents 
the seed account and the linked nodes are the followers. A solid line means that the tweet 
has diffused through the link by the retweet. In the model analysis, we ignore the over- 
counting of followers such as the one illustrated by the wavy line, while we take account 
of it in the data analysis. 



depending on the characteristics of the keywords. As we will see, any tweet 
diffusion of the daily tweets seems to have a random multiplicative process 
as the underlying mechanism. 

We believe that the information diffusion on Twitter is different from 
some other web contents which are discussed in the literature. It has been 
known that the total spread of the information of many web contents obey 



lognormal distributions 0, M, El 



There, one can think of a discrete-time 
random multiplicative process; the additional number of spreads in one time 
step depends on the total number of spreads by the previous time steps 
and a decaying function of time, i.e., N t+ i = N t (l + r t X t ), where t is the 
discrete time, N t is the total number of spreads by time t, X t is a positive 
stochastic variable, and r t is the decaying function. It is a natural modeling 
for featured contents and contents that people search for. In the case of 
Twitter, however, it is unnatural to assume such a mechanism because we 
expect that tweets diffuse through the followers and the probability of the 
retweet activity should not depend on the total spread; users usually do not 
look for the tweets to retweet. They retweet because they receive the tweets. 
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In the present paper, we will propose to classify the followers into gen- 
erations depending on the distance from the seed account and consider a 
stochastic process along them. The diffusion process which we present would 
probably occur in many other networks, especially on the web. An advan- 
tage of researching the Twitter data is that we can confirm our model directly 



thanks to the Twitter API |12j, which provides rich information. 

The present paper is organized as follows. In Sec. [2J we will introduce 
a stochastic model of the tweet diffusion along the generations. In Sec. |3l 
we will confirm that our model is indeed plausible by directly observing the 
stochastic variables of the model in the Twitter data. Note that we will fix 
a seed account when we analyze the data and there are some restrictions for 
the selection of the seed account (see Sec. I3.ip . Finally, after summarizing 
the present paper, we will argue what can be further expected in Sec. [51 

2. Model 

Before we construct a model, let us first explain how a tweet diffuses by 
retweet on Twitter in detail. Figure [T] shows a schematic picture of the tweet 
diffusion process. Whenever a user generates a tweet, it will be sent to N 
followers of the tweet owner, whom we call users in the zeroth generation. 
Next, when n\ users out of No followers retweet, the original tweet will be sent 
to the followers of the n\ retweeters; we call them users in the first generation. 
We label the number of the receivers in the first generation as N\. Such a 
chain of diffusion of a tweet continues until people stop retweeting or all the 
followers in the last generation are the users who already received the tweet 



13j. We will refer to the total number of receivers as iV to t = ^2^=0 Ng and 



J 9~ 

the total retweet count clS Th RT = YlgLi n 9-> where g stands for the label of 
the generation. While N is simply the number of the followers of the seed 
account, N g for g > 1 reads 

/=1 

where / stands for the label of each retweeter and kf stands for the number 
of his or her followers. The factor c g is the number of over-counting of the 
followers (e.g., the wavy line in Fig. [I]). In the case where the network is 
close to the tree structure and the distribution of the number of followers is 
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homogeneous, we can employ the approximation 

CO 

Ng^n g ^2kp g (k)=:n g k g , (2) 

fc=0 

where k and p g (k) are the number of the followers of the retweeters in the 
(g — l)th generation and its distribution, respectively. 

Let us next estimate the number of the retweeters, n g . Since there are 
N g _i candidates to generate the retweeters in the gth generation, we assume 

Ug = figNg-X, (3) 

where f} g is a variable which we call the retweet rate. Although j3 g is a 
discrete variable because n g and N g _i are integers, we treat it as if it were 
a continuous variable. In Sec. El we will observe that the retweet rate has a 
distribution over many incidents of tweet diffusion. We therefore regard /3 g 
continuous stochastic variable hereafter. 
Combining eqs. and flSJ), we have 

m 

N m = J m Nm-l = = JJ JgN , (4) 

9=1 

m—1 

n m = /3 m N m -i = ■ ■ ■ = f3 m ] [ J g N , (5) 

9=1 

where 

Jg = 0gkg, (Q) 

which is a stochastic variable because /3 g is a stochastic variable. Although 
the probability distribution of J\ may strongly depend on the characteristics 
of the seed account, J g for g > 2 are expected to obey a common proba- 
bility distribution. Therefore, the number of viewers of the tweet in each 
generation, N g , is expressed as a random multiplicative process because of 
the hierarchical structure of the followers. Note that the present model is 
not a standard percolation model, which assigns a stochastic variable to each 
follower, but a stochastic process with respect to each generation. We do not 
consider the time dependence of the retweet rates since most of the tweets 
finish diffusing very quickly jij. 

In the following section, we will directly observe the statistics of the 
retweet rates /3 g and confirm that our modeling is indeed plausible. 
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Figure 2: (Color online) The histograms of the retweet rates and the normal Q-Q plots 
of the logarithms of the retweet rates, (a) and (c) are of (3\ and @2 for ©Reuters, (b) and 
(d) are of (3\ and /?2 for @nytimes. 



3. Data analysis for the retweet rate (3 6 



Using the data sampled by the tool Twitter API [12[, we directly observed 
the behaviors of fii and /?2- We chose The New York Times (@nytimes) and 
Reuters Top News (©Reuters) for the seed accounts and sampled the diffusion 
data with > 0. The data are summarized in Table [TJ 

3.1. Possible errors, selection of the seed accounts, and restrictions 

There are some inevitable errors in our data. We cannot sample the data 
of private users and there might be some miscounts in n g because the follow- 
followed relation might have changed by the time we sampled the data. In 
order to sample the data as accurately as possible, we need to select the seed 
accounts carefully; we chose the seed accounts which tweet frequently and 
the number of whose followers are not changing rapidly so that we can expect 
the network around the seed account is almost static during the period of 
sampling. In order to see the statistical behavior clearly, it is good to choose 
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an account with a large number of followers and high retweet rates 14 . 
When we analyze the retweet rate (3 g , we take into account the factor of 
over-counting c g in Eq. (JT]), and thus we do not assume a tree structure nor 
the homogeneity of the distribution of the followers. 

3.2. Result 

Fi gure s |5f a) and Mb) show the histograms of 0i and their normal Q-Q 
plots [15j. They show that the retweet rate /3± seems to obey lognormal 
distributions with slight additive shifts, i.e. 

0i = e^ + Sx, (7) 

where u\ obeys Gaussian distributions Af((ii,af) with /ii being the mean 
and <j\ being the variance of g = 1. For /3 l5 the mean and the variance 
a\ seem to depend strongly on the character of the seed account. The slight 
additive shift might be due to the systematic activities by Twitter bots. 

We expect that the retweet rate 02 also obeys lognormal distributions 
with slight additive shifts. Figures |2]^c) and EJ^d) show the histograms of 02 
and their normal Q-Q plots; they indeed indicate the lognormal behaviors. 
For 02, the mean /x 2 and the variance erf are very close for both of the seed 
accounts; it seems to be plausible to model that the retweet rate g obeys a 
common probability distribution for g > 2. 

In Table [TJ we listed the averages of the over-counting of the users in the 
first generation, i.e. c\/ YTfLi^f i n Eq. (EE]). The over-counting of users are 
less than 5% on average, and thus the networks around the seed accounts 
have almost the tree structures. Although it is still doubtful whether the 
tree-structure approximation is appropriate in all generations, it is hard to 
imagine a drastic qualitative change to the diffusion phenomenon due to the 
loop correction since there is no back flow. 

Since we are fixing the seed account, N is a constant and the distribution 
of 0\ is proportional to that of n\. The number of followers in the first 
generation, N±, and the number of retweeters among them, 712, can take 
different values for each sample. As are shown in Figs. [3]^c) and[3^d), both 
of them obey lognormal distributions and they are not independent of each 
other. The correlation coefficients of ri2 and Ni, p(n2, iVi), have large positive 
values (see Table [TJ); the correlation coefficient varies from —1 to 1 and is 
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Figure 3: (Color online) The histograms of the retweet rates and the normal Q-Q plots of 
the logarithms of the number of followers and the retweet count in the second generation, 
(a) and (c) are of Ni and for @Reuters. (b) and (d) are of N\ and ni for @nytimcs. 



calculated by 



p(n 2 , iVi 



(n 2 N 1 )-(n 2 )(N 1 ) 



(8) 



Our result that the retweet rate fa obeys a lognormal distribution is plausible 
because lognormal distributions have the reproductive property, i.e. for two 
stochastic variables X\ and X2 which obey lognormal distributions, 

p{lnX 1 )*p{lnX 2 )=Af(fi 1 ,a 2 l )*Af{fi 2 ,^)=M{fi 1 + fi 2 ,a 2 1 + al), (9) 

where * stands for the convolution. 



4. Model analysis: estimation of the diffusion range 

From the model which we introduced above, we can estimate how much 
of the retweet rate (3 th (m) is required to reach the mth generation on average 
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©Reuters ©nytimes 



AT "U -C l i j. 

JN umber ot seed tweets 


1352 


i -i /in 

1140 


Period 


from Jun. 26, 2012 


from Jun. 19, 2012 




to Aug. 9, 2012 


to Aug. 18, 2012 


A T 


1 940 477 


5 882 680 


(«rt) 


74.0 


97.3 


((A), </3 2 » 


(3.27 x 10- 5 , 7.90 x 10" 5 ) 


(1.43 x 10" 5 , 8.52 x 10" 5 ) 


(W, <n 2 » 


(75 276, 4.46) 


(76 533, 4.54) 


p(n 2 , Ni) 


0.468 


0.481 


(ci/Efii^/) 


0.0471 


0.0470 


Fitting parameters 


/il = -10.51 


m = -n.51 


for /3i = e^ 1 + <Ji 


a\ = 0.6 


o\ = 0.77 


P(wi) = M{m,aj) 


$1 = 


$i = 1.0 x 10" 6 


Fitting parameters 


^2 = -9.65 


fi 2 = -9.51 


for /3 2 = e W2 + <5 2 


a\ = 0.68 


a\ = 0.69 


p(wa) = JV(/x 2 , <rf ) 


5 2 = -l.o x 10~ 5 


S 2 = -1.0 x 10~ 5 


Fitting parameters 


H = 10.64 


H = 10.58 


for Ni = & 1 + 5 


a 2 = 0.99 


a 2 = 1.04 




$ = 


5 = 2000 


Fitting parameters 


p = 0.48 


fi = 0.49 


for n 2 = e u + 5 


a 2 = 1.12 


a 2 = 1.23 


p{u)=N{p 1 a 2 ) 


5 = 0.5 


5 = 0.5 



Table 1: Data of tweet diffusions from ©Reuters and Qnytimes. The angular bracket 
(• • •) stands for the sample average. 
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and the average of the total number of retweets, (n-^r), for given parameters. 
In this section, we restrict ourselves to the case where the retweet rates (3 g 
are independent of each other and their averages have a common value (/?). 

According to Eq. (jSJ), the average of the number of retweets in the mth 
generation reads (n m ) = N k m (/3) m . Then we have the threshold for the 
retweet rate where the diffusion reaches the mth generation on average, i.e. 
(n m ) > 1: 

(3 th ( m ) = (VoF 1 " 1 )"™ = N~™k™~\ (10) 

The behavior of Eq. ( flOl) is exemplified in Fig. H(a); in the case of the seed 
accounts which we investigated, the tweets diffuse up to the second or the 
third generation (see and (P2) i n Tabled]). While we employed the mean 
value of n m in the definition of the threshold, it is also plausible to consider 
the median of n m instead. 

For a given range M of the diffusion, it is straightforward to calculate the 
average of the total number of retweets, 

- M ~ l l-((B)k) M 

(n RT ) = = N {p) «0>*)' = N M x _ J ■ (11) 

The behavior of Eq. (llljl is exemplified in Fig. 11(b); it shows that (urt) is 
not very sensitive to the diffusion range M in the case where ((3)k is small. 

5. Conclusion and Discussion 

We decomposed the diffusion of the daily tweets into the dynamics along 
the generations of the followers; the dynamics of retweet activities can be 
modeled as a random multiplicative process with respect to the generation. 
We directly observed the multiplicative factors from the actual data of Twit- 
ter and confirmed that our model is indeed plausible. We found that the 
multiplicative factors roughly obey lognormal distributions. The important 
point about our model is that the diffusion occurs owing to the repetition 
of cooperative activities along the followers and thus, as far as we know, it 
belongs to a different class of information diffusion compared to the ones in 
the literature jsl, [l0, We also believe that Twitter is not the only network 



where such a diffusion process occurs. 
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Figure 4: (Color online) (a) Threshold where the diffusion reaches the rath generation on 
average. We set k = 500. The points are for ©Reuters and @nytimes in the case where we 
assumed (/3) = ((3 2 ). (b) The average of the total number of retweets (tirt} as a function 
of the diffusion range M for various values of the average retweet rate (/3) . We set k = 500 
and plotted the cases where (/?} equals (/3 g ) of ©Reuters and @nytimes. We set the values 
of them for Nq, respectively. 

The model of the present paper is a minimal model. In order to estimate 
the behavior of diffusion precisely, the approximation of the tree structure is 
obviously too rough; it is necessary to embed the information about the rate 
of over-counting and the heterogeneity of the network. We also neglected the 
correlation between the retweet rates. We will consider its effect in a future 
study which may be published elsewhere. 

For the accuracy of the data, we sample the data of the news accounts only 
in the present paper. If Twitter API were updated and we tried a different 
way of sampling the data, we would be able to analyze the behaviors of many 
other accounts. Then we can proceed to a more quantitative analysis; for 
example, we would be able to measure the range of diffusion for each diffusion 
process. 
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